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Summary. This chapter describes componentwise Least Squares Support Vector Machines 
(LS-SVMs) for the estimation of additive models consisting of a sum of nonlinear compo- 
nents. The primal-dual derivations characterizing LS-SVMs for the estimation of the additive 
model result in a single set of linear equations with size growing in the number of data-points. 
The derivation is elaborated for the classification as well as the regression case. Furthermore, 
different techniques are proposed to discover structure in the data by looking for sparse com- 
ponents in the model based on dedicated regularization schemes on the one hand and fusion 
of the componentwise LS-SVMs training with a validation criterion on the other hand. 
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1 Introduction 

Non-linear classification and function approximation is an important topic of in- 
terest with continuously growing research areas. Estimation techniques based on 
regularization and kernel methods play an important role. We mention in this con- 
text smoothing splines (Wahba, 1990), regularization networks (Poggio and Girosi, 
1990), Gaussian processes (MacKay, 1992), Support Vector Machines (S VMs) (Vapnik, 
1998; Cristianini and Shawe-Taylor, 2000; Schoelkopf and Smola, 2002) and many 
more, see e.g. (Hastie et al, 2001). SVMs and related methods have been introduced 
within the context of statistical learning theory and structural risk minimization. 
In the methods one solves convex optimization problems, typically quadratic pro- 
grams. Least Squares Support Vector Machines (LS-SVMs) 3 (Suykens and Vande- 
walle, 1999; Suykens et al., 2002) are reformulations to standard SVMs which lead to 
solving linear KKT systems for classification tasks as well as regression. In (Suykens 
et al., 2002) LS-SVMs have been proposed as a class of kernel machines with primal- 
dual formulations in relation to kernel Fisher Discriminant Analysis (FDA), Ridge 



3 http://www.esat.kuleuven.ac.be/sista/lssvmlab 



Regression (RR), Partial Least Squares (PLS), Principal Component Analysis (PCA), 
Canonical Correlation Analysis (CCA), recurrent networks and control. The dual 
problems for the static regression without bias term are closely related to Gaussian 
processes (MacKay, 1992), regularization networks (Poggio and Girosi, 1990) and 
Kriging (Cressie, 1993), while LS-SVMs rather take an optimization approach with 
primal-dual formulations which have been exploited towards large scale problems 
and in developing robust versions. 

Direct estimation of high dimensional nonlinear functions using a non-parametric 
technique without imposing restrictions faces the problem of the curse of dimension- 
ality. Several attempts were made to overcome this obstacle, including projection 
pursuit regression (Friedmann and Stuetzle, 1981) and kernel methods for dimen- 
sionality reduction (KDR) (Fukumizu et ai, 2004). Additive models are very use- 
ful for approximating high dimensional nonlinear functions (Stone, 1985; Hastie 
and Tibshirani, 1990). These methods and their extensions have become one of 
the widely used nonparametric techniques as they offer a compromise between the 
somewhat conflicting requirements of flexibility, dimensionality and interpretability. 
Traditionally, splines are a common modeling technique (Wahba, 1990) for addi- 
tive models as e.g. in MARS (see e.g. (Hastie et ai, 2001)) or in combination with 
ANOVA (Neter et ai, 1974). Additive models were brought further to the attention of 
the machine learning community by e.g. (Vapnik, 1998; Gunn and Kandola, 2002). 
Estimation of the nonlinear components of an additive model is usually performed 
by the iterative backfitting algorithm (Hastie and Tibshirani, 1990) or a two-stage 
marginal integration based estimator (Linton and Nielsen, 1995). Although consis- 
tency of both is shown under certain conditions, important practical problems (num- 
ber of iteration steps in the former) and more theoretical problems (the pilot estimator 
needed for the latter procedure is a too generally posed problem) are still left. 

In this chapter we show how the primal-dual derivations characterizing LS-SVMs 
can be employed to formulate a straightforward solution to the estimation problem 
of additive models using convex optimization techniques for classification as well 
as regression problems. Apart from this one-shot optimal training algorithm, the 
chapter approaches the problem of structure detection in additive models (Hastie 
et ah, 2001; Gunn and Kandola, 2002) by considering an appropriate regularization 
scheme leading to sparse components. The additive regularization (AReg) frame- 
work (Pelckmans et ai, 2003) is adopted to emulate effectively these schemes based 
on 2-norms, 1-norms and specialized penalization terms (Antoniadis and Fan, 2001). 
Furthermore, a validation criterion is considered to select relevant components. Clas- 
sically, exhaustive search methods (or stepwise procedures) are used which can be 
written as a combinatorial optimization problem. This chapter proposes a convex 
relaxation to the component selection problem. 

This chapter is organized as follows. Section 2 presents componentwise LS-S VM 
regressors and classifiers for efficient estimation of additive models and relates the 
result with ANOVA kernels and classical estimation procedures. Section 3 introduces 
the additive regularization in this context and shows how to emulate dedicated reg- 
ularization schemes in order to obtain sparse components. Section 4 considers the 



problem of component selection based on a validation criterion. Section 5 presents a 
number of examples. 

2 Componentwise LS-SVMs and Primal-Dual Formulations 

2.1 The Additive Model Class 

Giving a training set defined as T>m — { x k,Uk}k=i c ^ D x M of size N drawn 
i.i.d. from an unknown distribution Fxy according to y k = f(xk) + &k where 
/ : R D — > R is an unknown real-valued smooth function, E[yk\X = Xk] = 
f(xk) and ei, . . . , ejv are uncorrected random errors with E [ek\X = Xk] = 0, 
E [(ek) 2 \X — Xk] — o\ < oo. The n data points of the validation set are de- 
noted as T>n^ = {x^ v \y^}" =1 . The following vector notations are used through- 
out the text: X = (x u ...,x N ) £ R DxN , Y = . . . , y N f G R N , = 

(x^, x { n ] ^j G R Dxn and = (y[ v) , y^ G K". The estimator of a 
regression function is difficult if the dimension D is large. One way to quantify this 
is the optimal minimax rate of convergence N~ 2l ^ 2l+D ^ for the estimation of an I 
times differentiable regression function which converges to zero slowly if D is large 
compared to I (Stone, 1982). A possibility to overcome the curse of dimensionality is 
to impose additional structure on the regression function. Although not needed in the 
derivation of the optimal solution, the input variables are assumed to be uncorrected 
(see also concurvity (Hastie and Tibshirani, 1990)) in the applications. 

Let superscript x d G R denote the d-th component of an input vector x G R D 
for all d = 1,...,D. Let for instance each component correspond with a different 
dimension of the input observations. Assume that the function / can be approximated 
arbitrarily well by a model having the following structure 

D 

f(x)=J2-f d ( xd ') + b > W 
d=i 

where f d : R — > R for all d = 1, . . . , D are unknown real-valued smooth func- 
tions and b is an intercept term. The following vector notation is used: X d = 

(xf,...,x d N ) e R lxN and X^ d = (x { ? )d , . . . , x { n )d ^j G R lxn . The optimal 

rate of convergence for estimators based on this model is N~ 2l l ( 2(+1 ) which is inde- 
pendent of D (Stone, 1985). Most state-of-the-art estimation techniques for additive 
models can be divided into two approaches (Hastie et al, 2001): 

• Iterative approaches use an iteration where in each step part of the unknown 
components are fixed while optimizing the remaining components. This is moti- 
vated as: 

f dl (xt)=y k -e k - f d2 K 2 )> ( 2 ) 



for all A: = 1,...,N and d± = 1,...,D. Once the N — 1 components of the 
second term are known, it becomes easy to estimate the lefthandside. For a large 
class of linear smoothers, such so-called backfitting algorithms are equivalent to 
a Gauss-Seidel algorithm for solving a big (ND x ND) set of linear equations 
(Hastie et al, 2001). The backfitting algorithm (Hastie and Tibshirani, 1990) is 
theoretically and practically well motivated. 
• Two-stages marginalization approaches construct in the first stage a general 
black-box pilot estimator (as e.g. a Nadaraya- Watson kernel estimator) and fi- 
nally estimate the additive components by marginalizing (integrating out) for 
each component the variation of the remaining components. 



2.2 Componentwise Least Squares Support Vector Machine Regressors 

At first, a primal-dual formulation is derived for componentwise LS-S VM regressors. 
The global model takes the form as in (1) for any x* G R D 

D D 

f(x.;w d , b) = J2 f d {<: w d ) + b = J2 Wd T M4) + b- (3) 

d=l d=l 

The individual components of an additive model based on LS-SVMs are written 
as f d (x d ;w d ) — wjip d (x d ) in the primal space where ip d : E — > M. n *d denotes 
a potentially infinite (n iPd = oo) dimensional feature map. The regularized least 
squares cost function is given as (Suykens et al, 2002) 

^ D N 

min Jj(w d , e) = - V w d T w d + J e l 

d =l k=l 
D 

s.t. J2 w d Tl Pd(xi) + b + e k =y k , k=l,...,N. (4) 

Note that the regularization constant 7 appears here as in classical Tikhonov regular- 
ization (Tikhonov and Arsenin, 1977). The Lagrangian of the constraint optimization 
problem becomes 

^ D N N D 

£ 7 (w d ,6, e k ;a k ) = - ^w d T w d +^ 1 ^el~^a k (^w d T ip d {xf)+b+e k -y k ) . 

d=l fe=l fc=l d=l 

(5) 

By taking the conditions for optimality d£j/da k = 0, dC 7 /db = 0, dC 7 /de k = 
wddC 1 /dw d = and application of the kernel trick K d (xf,, xj) = ip d (xf) T ip d (x d ) 
with a positive definite (Mercer) kernel K d : R x M — > M, one gets the following 
conditions for optimality 

Vk = Ed=i w d T ip d (xf) + b + e k , k = 1, . . . , N (a) 
ekl = a k ^ k = 1, . . . , N (b) 

w d = EfeLi a k <fd(x d k ) d= 1,...,D (c) 



Note that condition (6.b) states that the elements of the solution vector a should be 
proportional to the errors. The dual problem is summarized in matrix notation as 
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where Q e M. NxN with Q = J2d=i Qd and n ti = K<l (xt x i) for a11 M = 
1, . . . , N, which is expressed in the dual variables a instead of w. A new point x* 6 
R D can be evaluated as 



D 



y* = f d (x* ;a,b) = J2 & *J2 ^(^*) + b, 



(8) 



k=l d=l 



where a and b is the solution to (7). Simulating a validation datapoint xj for all 
j = 1, . . . , n by the d-th individual component 



(9) 



k=i 



which can be summarized as follows: Y = (yi, . . . , j/at) s M , Y d — (yf , 

6 K w , = and y« = . . . , y r ( : )d ) T £ R«. 

Remarks: 

• Note that the componentwise LS-SVM regressor can be written as a linear 
smoothing matrix (Suykens et al, 2002): 



y = 5 7 y. 



(10) 



For notational convenience, the bias term is omitted from this description. The 
smoother matrix S~ £ M. NxN becomes 



5 7 = q [ n + I N 



(11) 



• The set of linear equations (7) corresponds with a classical LS-SVM regressor 
where a modified kernel is used 



K{x k , Xj ) = Y^K d {x d k ,xj). 



(12) 



d=i 



Figure 1 shows the modified kernel in case a one dimensional Radial Basis Func- 
tion (RBF) kernel is used for all D (in the example, D — 2) components. This 
observation implies that componentwise LS-SVMs inherit results obtained for 
classical LS-SVMs and kernel methods in general. From a practical point of 
view, the previous kernels (and a fortiori componentwise kernel models) result 



in the same algorithms as considered in the ANOVA kernel decompositions as in 
(Vapnik, 1998; Gunn and Kandola, 2002). 

K(x k , Xj ) = f2K d (xix«)+ J2 KM((x£,xt>) T ,(x*\x?) T )+..., 

d.— l di^d,2 

(13) 

where the componentwise LS-SVMs only consider the first term in this expan- 
sion. The described derivation as such bridges the gap between the estimation of 
additive models and the use of ANOVA kernels. 




Fig. 1. The two dimensional componentwise Radial Basis Function (RBF) kernel for compo- 
nentwise LS-SVMs takes the form K(x k , xi) = K 1 {x\, x]) + K 2 (x\, x 2 ) asdisplayed. The 
standard RBF kernel takes the form K(xk,xi) = exp(— \\xk — xi\\ 2 2 /<j 2 ) with a G Rq an 
appropriately chosen bandwidth. 



2.3 Componentwise Least Squares Support Vector Machine Classifiers 

In the case of classification, let yk,y^ £ {— 1, 1} for all fc = 1,...,N and 
j = 1, . . . , n. The analogous derivation of the componentwise LS-SVM classifier 
is briefly reviewed. The following model is considered for modeling the data 



/(x)=sign (j2f d (x d ) + b^j , 



(14) 



where again the individual components of the additive model based on LS-SVMs 
are given as f d (x d ) = Wd T <Pd(x d ) in the primal space where ipd ■ R — > K" Pd 
denotes a potentially infinite (n Vd = oo) dimensional feature map. The regularized 
least squares cost function is given as (Suykens and Vandewalle, 1999; Suykens et 
al, 2002) 



mm 

Wd,b,ek 



4 



Jj(w d , e) = ^^2 w d Tw d + ! X/ 

d=l fc=l 

s.t. y fc + =l-e fc , fc=l,...,JV, (15) 

where e k are so-called slack- variables for all /c = 1, . . . , N. After construction of the 
Lagrangian and taking the conditions for optimality, one obtains the following set of 
linear equations (see e.g. (Suykens et al, 2002)): 
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(16) 



where fi y e K ivxJV with fi y 



New data points G R D can be evaluated as 



( N D A 

= sign ^ a kVk K d {x d kl xi) + b 
\fe=i d=i / 



(17) 



In the remainder of this text, only the regression case is considered. The classification 
case can be derived straightforwardly along the lines. 



3 Regularizing for Sparse Components via Additive 
Regularization 

A regularization method fixes a priori the answer to the ill-conditioned (or ill- 
defined) nature of the inverse problem. The classical Tikhonov regularization scheme 
(Tikhonov and Arsenin, 1977) states the answer in terms of the norm of the solu- 
tion. The formulation of the additive regularization (AReg) framework (Pelckmans 
et al, 2003) made it possible to impose alternative answers to the ill-conditioning of 
the problem at hand. We refer to this AReg level as substrate LS-SVMs. An appropri- 
ate regularization scheme for additive models is to favor solutions using the smallest 
number of components to explain the data as much as possible. In this paper, we use 
the somewhat relaxed condition of sparse components to select appropriate compo- 
nents instead of the more general problem of input (or component) selection. 



3.1 Level 1: Componentwise LS-SVM Substrate 
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Fig. 2. Graphical representation of the additive regularization framework used for emulat- 
ing other loss functions and regularization schemes. Conceptually, one differentiates between 
the newly specified cost function and the LS-SVM substrate, while computationally both are 
computed simultanously. 



Using the Additive Regularization (AReg) scheme for componentwise LS-SVM 
regressors results into the following modified cost function: 



1 ° 1 N 

min Jc(w d , e) = -V w d T w d + - Y7e fe - c k ) 



Wd,b,ek 



d=l 



fe=l 



D 



s.t. J2 w d T <Pd(4) + b + e k = y k , k=l,...,N, (18) 



d=l 



where Ck G M for all k = 1,...,N. Let c = (ci , . . . , cjv) G R . After constructing 
the Lagrangian and taking the conditions for optimality, one obtains the following set 
of linear equations, see (Pelckmans et al, 2003): 
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(19) 



and e 
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. Given a regularization constant vector c, the unique solution 



follows immediately from this set of linear equations. 

However, as this scheme is too general for practical implementation, c should be 
limited in an appropriate way by imposing for example constraints corresponding 
with certain model assumptions or a specified cost function. Consider for a moment 



the conditions for optimality of the componentwise LS-S VM regressor using a regu- 
larization term as in ridge regression, one can see that equation (7) corresponds with 
(19) if 7 _1 a = a + c for given 7. Once an appropriate c is found which satisfies 
the constraints, it can be plugged in into the LS-SVM substrate (19). It turns out that 
one can omit this conceptual second stage in the computations by elimination of the 
variable c in the constrained optimization problem (see Figure 2). 

Alternatively, a measure corresponding with a (penalized) cost function can be 
used which fulfills the role of model selection in a broad sense. A variety of such 
explicit or implicit limitations can be emulated based on different criteria (see Figure 
3). 



Validation set \ Training set 



Level 2: 




Fig. 3. The level 2 cost functions of Figure 2 on the conceptual level can take different forms 
based on validation performance or trainings error. While some will result in convex tuning 
procedures, other may loose this property depending on the chosen cost function on the second 
level. 



3.2 Level 2: Emulating an L± based Component Regularization Scheme 

(Convex) 

We now study how to obtain sparse components by considering a dedicated regular- 
ization scheme. The LS-SVM substrate technique is used to emulate the proposed 
scheme as primal-dual derivations (see e.g. Subsection 2.2) are not straightforward 
anymore. 

Let Y d E l w denote the estimated training outputs of the d-th submodel f d 
as in (9). The component based regularization scheme can be translated as the fol- 
lowing constrained optimization problem where the conditions for optimality (18) as 
summarized in (19) are to be satisfied exactly (after elimination of w) 



min J 6 (Y 

c,Y d ,ek;ot.b 




(20) 



where the use of the robust L\ norm can be justified as in general no assumptions 
are imposed on the distribution of the elements of Y d . By elimination of c using the 
equality e = a + c, this problem can be written as follows 



min J^Y^ek) 

Y d ,e k ;a,b 
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(21) 



This convex constrained optimization problem can be solved as a quadratic program- 
ming problem. As a consequence of the use of the L\ norm, often sparse components 
(||F d ||i = 0) are obtained, in a similar way as sparse variables of LASSO or sparse 
datapoints in SVM (Hastie et al, 2001; Vapnik, 1998). An important difference is 
that the estimated outputs are used for regularization purposes instead of the solution 
vector. It is good practice to omit sparse components on the training dataset from 
simulation: 



N 



=1 des D 



(22) 



where S D = {d\a T fl d a ^ 0}. 

Using the L 2 norm J2d=i ll^lli mstea d leads to a much simpler optimization 
problem, but additional assumptions (Gaussianity) are needed on the distribution of 
the elements of Y d . Moreover, the component selection has to resort on a significance 
test instead of the sparsity resulting from (21). A practical algorithm is proposed in 
Subsection 5.1 that uses an iteration of L 2 norm based optimizations in order to 
calculate the optimum of the proposed regularized cost function. 



3.3 Level 2 bis: Emulating a Smoothly Thresholding Penalty Function 

(Non-convex) 

This subsection considers extensions to classical formulations towards the use of 
dedicated regularization schemes for sparsifying components. Consider the compo- 
nentwise regularized least squares cost function defined as 



A ° 1 N 

J x ( Wd ,e) = -Y,t(w d ) + -Y,4, (23) 

d=l k=l 

where ^(wa) is a penalty function and A 6 M + acts as a regularization parameter. 
We denote A^(-) by £\(-), so it may depend on A. Examples of penalty functions 
include: 

• The L p penalty function £^(wd) — X\\wd\\p leads to a bridge regression (Frank 
and Friedman, 1993; Fu, 1998). It is known that the L 2 penalty function p = 2 
results in the ridge regression. For the L\ penalty function the solution is the 
soft thresholding rule (Donoho and Johnstone, 1994). LASSO, as proposed by 
(Tibshirani, 1996; Tibshirani, 1997), is the penalized least squares estimate using 
the L\ penalty function (see Figure 4. a). 

• Let the indicator function I{ x eA} = 1 if £ £ A for a specified set A and 
otherwise. When the penalty function is given by £\(wd) = A 2 — — 
A) 2 /{|| M , ti || 1< A} (see Figure 4.b), the solution is a hard-thresholding rule (Antoniadis, 
1997). 

The L p and the hard thresholding penalty functions do not simultaneously satisfy 
the mathematical conditions for unbiasedness, sparsity and continuity (Fan and Li, 
2001). The hard thresholding has a discontinuous cost surface. The only continuous 
cost surface (defined as the cost function associated with the solution space) with 
a thresholding rule in the i p -family is the L\ penalty function, but the resulting 
estimator is shifted by a constant A. To avoid these drawbacks, (Nikolova, 1999) 
suggests the penalty function defined as 

t\{w d ) = (24) 

with a G Kg . This penalty function behaves quite similarly as the Smoothly Clipped 
Absolute Deviation (SCAD) penalty function as suggested by (Fan, 1997). The 
Smoothly Thresholding Penalty (TTP) function (24) improves the properties of the 
Li penalty function and the hard thresholding penalty function (see Figure 4.c), see 
(Antoniadis and Fan, 2001). The unknowns a and A act as regularization parameters. 
A plausible value for a was derived in (Nikolova, 1999; Antoniadis and Fan, 2001) 
as a = 3.7. The transformed L\ penalty function satisfies the oracle inequalities 
(Donoho and Johnstone, 1994). One can plugin the described semi-norm l^(-) to 
improve the component based regularization scheme (20). Again, the additive regu- 
larization scheme is used for the emulation of this scheme 

min Jx (Y d ,e k ) = lj2 ea ^ + llb e l 

c,Y«,e k ;a,b 2 ^ 2 ^ 



S.t. 



lJfCt = 0, 

f2a + l T N b + a + c = Y, 

H d a = Y d , Vd=l,...,D 1 ^ 

a + c = e, 
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Fig. 4. Typical penalty functions: (a) the L p penalty family for p = 2,1 and 0.6, (b) hard 
thresholding penalty function and (c) the transformed L\ penalty function. 

which becomes non-convex but can be solved using an iterative scheme as explained 
later in Subsection 5.1. 



4 Fusion of Componentwise LS-SVMs and Validation 

This section investigates how one can tune the componentwise LS-SVMs with re- 
spect to a validation criterion in order to improve the generalization performance 
of the final model. As proposed in (Pelckmans et al, 2003), fusion of training and 
validation levels can be investigated from an optimization point of view, while con- 
ceptually they are to be considered at different levels. 

4.1 Fusion of Componentwise LS-SVMs and Validation for Regularization 
Constant Tuning 

For this purpose, the fusion argument as introduced in (Pelckmans et al, 2003) is 
briefly revised in relation to regularization parameter tuning. The estimator of the 
LS-SVM regressor on the training data for a fixed value 7 is given as (4) 



which results into solving a linear set of equations (7) after substitution of w by 
Lagrange multipliers a. Tuning the regularization parameter by using a validation 
criterion gives the following estimator 



Level 1 : (w, b) = argmin J-y{w, e) s.t. (4) holds, 



(26) 




(27) 



satisfying again (4). Using the conditions for optimality (7) and eliminating w and e 



n 

Fusion : (7, a, b) — argmin^ (f(xj;a, b) — yj) 2 s.t. (7) holds, 

-y,a,b . =1 

(28) 

which is referred to as fusion. The resulting optimization problem was noted to be 
non-convex as the set of optimal solutions w (or dual a's) corresponding with a 
7 > is non-convex. To overcome this problem, a re-parameterization of the trade- 
off was proposed leading to the additive regularization scheme. At the cost of over- 
parameterizing the trade-off, convexity is obtained. To circumvent this drawback, 
different ways to restrict explicitly or implicitly the (effective) degrees of freedom 
of the regularization scheme c 6 A C M. N were proposed while retaining convexity 
((Pelckmans et ai, 2003)). The convex problem resulting from additive regulariza- 
tion is 

n 

Fusion : (c, a, b) = argmin^ (f(xj-,a, b) — yj) s.t. (19) holds, 

c£A,a,b j =1 

(29) 

and can be solved efficiently as a convex constrained optimization problem if A is a 
convex set, resulting immediately in the optimal regularization trade-off and model 
parameters (Boyd and Vandenberghe, 2004). 

4.2 Fusion for Component Selection using the Additive Regularization Scheme 

One possible relaxed version of the component selection problem goes as follows: 
Investigate whether it is plausible to drive the components on the validation set 
to zero without too large modifications on the global training solution. This is 
translated as the following cost function much in the spirit of (20). Let fi^ de- 
note J2d=i Q(v)d e K nxJV and Q [ $ d = K d {xf )d , x d k ) for all j = 1, . . . , n and 
/,• I V. 

c,r<*,r(»)<i,e,a,& z d=1 z d=1 z fe=1 

' l T N a = 
a + c — e 

s.t. } Qa + l N b + a + c = Y (30) 
fl d a = Y d , W=1,...,D, 
^ Q (v)d a = Y(v)d^ Vd=l,...,D, 

where the equality constraints consist of the conditions for optimality of (19) and 
the evaluation of the validation set on the individual components. Again, this convex 
problem can be solved as a quadratic programming problem. 



4.3 Fusion for Component Selection using Componentwise Regularized 
LS-SVMs 



We proceed by considering the following primal cost function for a fixed but strictly 
positive 77 = (t) U . . . , ri D ) T £ {R^) D 

1 D T 1 N 

t 11 • <t < \ Wd Wd 1 2 
Level 1 : mm e) = - ^ — + 2^ e * 

a— 1 k—1 

D 

si. ^2w d T <fd(x d k ) + b + e k = y k , k=l,...,N. (31) 

d=l 

Note that the regularization vector appears here similar as in the Tikhonov regular- 
ization scheme (Tikhonov and Arsenin, 1977) where each component is regularized 
individually. The Lagrangian of the constrained optimization problem with multipli- 
ers a v £ M. N becomes 

d T N 
C n (w d , b, e fe ; a k ) = - ^ — + 2 e fc 

d=l ' d fc=l 
TV _D 

-H a feE WdT ^( X fe) + 6 + e ^y fe )- (32) 
fe=l d=l 

By taking the conditions for optimality dC n /da k — 0, dC n /db = 0, dC v /de k = 
and dC v /dwd = 0, one gets the following conditions for optimality 
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(33) 



The dual problem is summarized in matrix notation by application of the kernel trick, 
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(34) 



where £ R NxN with I2 r ' = J2 d =i Vd^ d and fl d kl = K d (xf, xf). A new point 
£ R D can be evaluated as 

N D 

= />*; &) = ^>fc E VdK d (x d k ,xt) + b, (35) 
fe=i d=i 

where d and 6 are the solution to (34). Simulating a training datapoint x k for all 
fc = 1, . . . , 7Y by the d-th individual component 



N 



(36) 



1=1 



which can be summarized in a vector Y v ' d = (yf, . . . ,y N ) G R^. As in the previous 
section, the validation performance is used for tuning the regularization parameters 



Level 



2 

2 : f) — arg min ( f(xj ;a v ,b) — yj ) with (a , b) = argminj^, 



3 = 1 



(37) 



or using the conditions for optimality (34) and eliminating w and e 

n 

Fusion: (fj, a n , b) = argminy^ (f(xj;a n , b) — yj) 2 s.t. (34) holds, 

ri,a'',b . =1 

(38) 

which is a non-convex constrained optimization problem. 

Embedding this problem in the additive regularization framework will lead us to a 
more suitable representation allowing for the use of dedicated algorithms. By relating 
the conditions (19) to (34), one can view the latter within the additive regularization 
framework by imposing extra constraints on c. The bias term b is omitted from the 
remainder of this subsection for notational convenience. The first two constraints 
reflect training conditions for both schemes. As the solutions a v and a do not have 
the same meaning (at least for model evaluation purposes, see (8) and (35)), the 
appropriate c is determined here by enforcing the same estimation on the training 
data. In summary: 



{n + I N )a + c = Y 
((Eii%^)+^) ^=Y 



{Q + I N )a + c- 



= Y 



{a + c), 



(39) 

where the second set of equations is obtained by eliminating a v . The last equation 
of the righthand side represents the set of constraints of the values c for all possible 
values of i]. The product ® denotes 7] T ® In = [tjiIn, ■ • ■ , VdIn] £ M. NxND . As 
for the Tikhonov case, it is readily seen that the solution space of c with respect to 
r\ is non-convex, however, the constraint on c is recognized as a bilinear form. The 
fusion problem (38) can be written as 



Fusion : (fj, a, c) = arg min 



Q(v) a _ yW 



s.t. (39) holds, 



(40) 



where algorithms as alternating least squares can be used. 



5 Applications 



For practical applications, the following iterative approach is used for solving non- 
convex cost-functions as (25). It can also be used for the efficient solution of convex 
optimization problems which become computational heavy in the case of a large 
number of datapoints as e.g. (21). A number of classification as well as regression 
problems are employed to illustrate the capabilities of the described approach. In the 
experiments, hyper-parameters as the kernel parameter (taken to be constants over 
the components) and the regularization trade-off parameter 7 or £ were tuned using 
10-fold cross-validation. 



5.1 Weighted Graduated Non-Convexity Algorithm 

An iterative scheme was developed based on the graduated non-convexity algorithm 
as proposed in (Blake, 1989; Nikolova, 1999; Antoniadis and Fan, 2001) for the opti- 
mization of non-convex cost functions. Instead of using a local gradient (or Newton) 
step which can be quite involved, an adaptive weighting scheme is proposed: in every 
step, the relaxed cost function is optimized by using a weighted 2-norm where the 
weighting terms are chosen based on an initial guess for the global solution. For every 
symmetric loss function £(\e\) : M + — > M. + which is monotonically increasing, there 
exists a bijective transformation t : M — > K such that for every e = y — f(x; 9) e M 

lie) = (t(e)f . (41) 

The proposed algorithm for computing the solution for semi-norms employs itera- 
tively convex relaxations of the prescribed non-convex norm. It is somewhat inspired 
by the simulated annealing optimization technique for optimizing global optimiza- 
tion problems. The weighted version is based on the following derivation 



£{e k ) = (y k e k f O v k = J-^jr, (42) 

V e k 

where the e k for all k = 1, . . . , N are the residuals corresponding with the solu- 
tions to 9 = argming £ (y k — f(x k ; 9)) This is equal to the solution of the convex 
optimization problem e k = argmin^ (v k (y k — f{x k ;9))) 2 for a set of v k satisfying 
(42). For more stable results, the gradient of the penalty function £ and the quadratic 
approximation can be takne equal as follows by using an intercept parameter [i k e M 
forallfc= 1,...,N: 

( £(e k ) = (v k e k ) 2 +/j, k 
\l'{e k )=2vle k 

where £'(e k ) denotes the derivative of £ evaluated in e k such that a minimum of 
Jl also minimizes the weighted equivalent (the derivatives are equal). Note that the 
constant intercepts [i k are not relevant in the weighted optimization problem. Under 
the assumption that the two consecutive relaxations £^> and £( t+1 ) do not have too 
different global solutions, the following algorithm is a plausible practical tool: 



2e fe 
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£'(e k ) 



(43) 
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(a) (b) 

Fig. 5. (a) Weighted L2-norm (dashed) approximation (v k e k ) 2 + fx k of the Li-norm (solid) 
1(e) = |e|i which follows from the linear set of equations (43) once the optimal et areknown; 
(b) the weighting terms v k for a sequence of e k and k = 1, . . . , N such that (v k e k ) 2 + fi k = 
\e k \i and2v\e k — I' (e k ) — sign(e fe ) for an appropriate fi k . 

Algorithm 1 (Weighted Graduated Non-Convexity Algorithm) For the optimiza- 
tion of semi-norms (£(■)), a practical approach is based on deforming gradually a 
2-norm into the specific loss function of interest. Let Qbe a strictly decreasing series 
1) > , • ■ • , 0. A plausible choice for the initial convex cost function is the least 
squares cost function J\£,(e) = 1 1 e 1 1 

1. Compute the solution 9^ for Li norm J\s(e) — ||e||| with residuals ; 

2. t = 0andvW = 1 N ; 

3. Consider the following relaxed cost function J 1 ^ (e) = (1 — Ct)^(e) + CtJi,s{e)>' 

4. Estimate the solution 8^ t+1 ^ and corresponding residuals e£ +1 ^ of the cost func- 
tion 

jit) 

using the weighted approximation j7kpprox = {vf' >e k) 2 of J^\ek) 

5. Reweight the residuals using weighted approximative squares norms as derived 
in (43): 

6. t := t + 1 and iterate step (3,4,5,6) until convergence. 

When iterating this scheme, most v k will be smaller than 1 as the least squares cost 
function penalizes higher residuals (typically outliers). However, a number of resid- 
uals will have increasing weight as the least squares loss function is much lower for 
small residuals. 

5.2 Regression examples 

To illustrate the additive model estimation method, a classical example was con- 
structed as in (Hastie and Tibshirani, 1990; Vapnik, 1998). The data were gener- 
ated according to y k = 10sinc(a:J.) + 20 (x\ — 0.5) 2 + 10 x\ + 5 x\ + were 




Fig. 6. Example of a toy dataset consisting of four input components X 1 , X 2 , X 3 and X 4 
where only the first one is relevant to predict the output f(x) = sinc^ 1 ). A componentwise 
LS-SVM regressor (dashed line) has good prediction performance, while the L\ penalized cost 
function of Subsection (3.2) also recovers the structure in the data as the estimated components 
correspnding with X 2 , X 3 and X 4 are sparse. 



ek ~ A/"(0, 1), N = 100 and the input data X are randomly chosen from the interval 
[0, l] 10 . Because of the Gaussian nature of the noise model, only results from least 
squares methods are reported. The described techniques were applied on this training 
dataset and tested on an independent test set generated using the saem rules. Table 1 
reports whether the algorithm recovered the structure in the data (if so, the measure 
is 100%). The experiment using the smoothly tresholding penalized (STP) cost func- 
tion was designed as follows: for every 10 components, a version was provided for 
the algorithm for the use of a linear kernel and another for the use of a RBF kernel 
(resulting in 20 new components). The regularization scheme was able to select the 
components with the appropriate kernel (a nonlinear RBF kernel for X 1 and X 2 and 
linear ones for X 3 and X 4 ), except for one spurious component (A RBF kernel was 
selected for the fifth component). 

5.3 Classification example 

An additive model was estimated by an LS-SVM classifier based on the spam data 
as provided on the UCI benchmark repository, see e.g. (Hastie et ai, 2001). The data 



Method 


Test set Performance 


Sparse components 




L 2 


Li 


Loo 


% recovered 


LS-SVMs 


0.1110 


0.2582 


0.8743 


0% 


componentwise LS-SVMs (7) 


0.0603 


0.1923 


0.6249 


0% 


L 1 regularization (21) 


0.0624 


0.1987 


0.6601 


100% 


STP with RBF (25) 


0.0608 


0.1966 


0.6854 


100% 


STP with RBF and lin (25) 


0.0521 


0.1817 


0.5729 


95% 


Fusion with AReg (30) 


0.0614 


0.1994 


0.6634 


100% 


Fusion with comp. reg. (40) 


0.0601 


0.1953 


0.6791 


100% 



Table 1. Results on test data of numerical experiments on the Vapnik regression dataset. The 
sparseness is expressed in the rate of components which is selected only if the input is relevant 
(100% means the original structure was perfectly recovered). 



consists of word frequencies from 4601 email messages, in a study to screen email 
for spam. A test set of size 1536 was drawn randomly from the data leaving 3065 
to training purposes. The inputs were preprocessed using following transformation 
p(x) — log(l + x) and standardized to unit variance. Figure 7 gives the indicator 
functions as found using a regularization based technique to detect structure as de- 
scribed in Subsection 3.3. The structure detection algorithm selected only 6 out of the 
56 provided indicators. Moreover, the componentwise approach describes the form 
of the contribution of each indicator, resulting in an highly interpretable model. 



6 Conclusions 

This chapter describes nonlinear additive models based on LS-SVMs which are ca- 
pable of handling higher dimensional data for regression as well as classification 
tasks. The estimation stage results from solving a set of linear equations with a size 
approximatively equal to the number of training datapoints. Furthermore, the addi- 
tive regularization framework is employed for formulating dedicated regularization 
schemes leading to structure detection. Finally, a fusion argument for component 
selection and structure detection based on training componentwise LS-SVMs and 
validation performance is introduced to improve the generalization abilities of the 
method. Advantages of using componentwise LS-SVMs include the efficient esti- 
mation of additive models with respect to classical practice, interpretability of the 
estimated model, opportunities towards structure detection and the connection with 
existing statistical techniques. 
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Fig. 7. Results of the spam dataset. The non-sparse components as found by application of 
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